Multimodal representation learning for predicting molecule–disease relations

Abstract Motivation Predicting molecule–disease indications and side effects is important for drug development and pharmacovigilance. Comprehensively mining molecule–molecule, molecule–disease and disease–disease semantic dependencies can potentially improve prediction performance. Methods We introduce a Multi-Modal REpresentation Mapping Approach to Predicting molecular-disease relations (M2REMAP) by incorporating clinical semantics learned from electronic health records (EHR) of 12.6 million patients. Specifically, M2REMAP first learns a multimodal molecule representation that synthesizes chemical property and clinical semantic information by mapping molecule chemicals via a deep neural network onto the clinical semantic embedding space shared by drugs, diseases and other common clinical concepts. To infer molecule–disease relations, M2REMAP combines multimodal molecule representation and disease semantic embedding to jointly infer indications and side effects. Results We extensively evaluate M2REMAP on molecule indications, side effects and interactions. Results show that incorporating EHR embeddings improves performance significantly, for example, attaining an improvement over the baseline models by 23.6% in PRC-AUC on indications and 23.9% on side effects. Further, M2REMAP overcomes the limitation of existing methods and effectively predicts drugs for novel diseases and emerging pathogens. Availability and implementation The code is available at https://github.com/celehs/M2REMAP, and prediction results are provided at https://shiny.parse-health.org/drugs-diseases-dev/. Supplementary information Supplementary data are available at Bioinformatics online.

: Comparison results evaluated in ROC-AUC of M2REMAP using different feature extractors for drug-indication prediction on PrimeKG [3] and side effects prediction on SIDER (Zhang) [4,5].

Method
PrimeKG SIDER 4.1 Transformer [1] 0.853 0.886 MPNN [2] 0.860 0.907 CNN+Bi-GRUs 0.882 0.901 Comparison of Feature Extractors We have studied different baseline feature extractors which include the proposed CNN+bi-GRUs and Transformer [1], both of which receive molecular SMILE, and MPNN [2] which works on molecular graphs. The results in Table 1 show that the proposed feature extractor achieves the best performance on PrimeKG [3] for predicting drug indications and attains comparable performance to that of MPNN on the SIDER(Zhang) [4,5] for predicting drug side effects. For consistency, we use the proposed feature extractor across all experiments.  We first learn a new set of EHR embedding using 4CE data which includes concepts of COVID-19. Then, we transform 4CE embeddings to VA CDW embeddings and infer the relations between COVID-19 and Drugbank molecules.

Supplementary Note: Molecules for COVID-19
The pipeline to predict potential molecules for COVID-19 is illustrated in Figure  1. We first obtain a set of 200-dimensional embedding using the EHR data from the Consortium for Clinical Characterization of COVID-19 by EHR (4CE) Phase 2.2 [6]. It includes EHR data of COVID-19 patients from above 200 hospitals in 8 countries. Since the 4CE data is COVID-specific and contains only a small group of concepts and drugs, we map the 4CE embedding to the VA CDW embedding for relation inference. There are 2105 shared diagnostic concepts between 4CE and VA CDW EHR data. We train a multi-layer perception network (MLP) to learn the mapping from 4CE embedding to VA embedding via supervised   In (b) we visualize several typical diseases represented using Phecodes and the molecules that are literature-validated to be cancer-therapeutic.
regression using the 2105 shared codes as labels. We use the mean squared error as the training objective for the regression. The MLP regression network consists of two layers, namely a hidden layer with 200 units and an output layer with 100 units. In the 4CE, there are 5 concepts that are related to COVID-19, namely "PCR positive", "PCR negative", "U07.1", "COVID viral" and "COVID vaccine". Among them, we empirically find that "PCR positive" works better and thus represents COVID-19 using this concept. We use the M2REMAP trained on the annotated drug indications from PrimeKG [3] to predict the relations between COVID-19 and all Drugbank molecules [7].

Supplementary Note: Embedding Visualization of Novel Molecules
We visualize all EHR concepts and Drugbank molecules to show that M2REMAP successfully transforms novel molecule chemical structures to the EHR embedding space using the deep neural network. As shown in Figure 3 (a), the Drugbank molecules majorly follow the same embedding distribution as the clinical concepts. This facilitates M2REMAP to generalize to novel molecules to infer their relations with EHR diseases. Then, in Figure 3 (b), we visualize the molecules that are predicted to be therapeutic to cancers and are validated via literature reviews. Also, we visualize several representative diseases represented by diagnosis codes and 2% of the randomly selected Drugbank molecules. The results are consistent with the observations that molecules and the related indications tend to be close in the embedding space. For example, Vanoxerine (DB03701) is close to liver cancer and chronic hepatitis and is shown to treat hepatocellular carcinoma in [8].

Supplementary Note: Sampling of Negative Drug-Disease Relations
We select negative molecule-disease relation per the EHR embedding similarity.
For each molecule, we require the selected negative side effects or indications to be dissimilar to any of the reported. Different threshold values are used for indications and side effects. For side effects, the threshold value is 0.2. For indications, the threshold value is 0.5.

Supplementary Note: Training Algorithm
The training details of the deep neural network to learn molecule-disease relations are provided in Algorithm 1.

Supplementary Note: Analysis of Improvements from EHR Semantic Embedding
We study how EHR semantic embedding vectors improve drug-disease predictions. We visualize the performance gain in PRC-AUC, namely P gain = P f ull /(P full − P base ), where P f ull denotes the PRC-AUC of the full model of M2REMAP that exploits semantic embedding vectors and P base is the PRC-AUC of baseline model without EHR semantics. We visualize the performance gain by drug/disease groups. For drugs, we map them to RxCUI 1 and get their hierarchy. Each is mapped to the corresponding LEVEL1 concept, which consists of 14 groups such as "sensory organs", "respiratory system", etc. For indications/side effects, we map the CUIs to the ICD-10-CM, which includes 21 topics such as "neoplasms", "nervous-system diseases", etc. For each drug/disease group, we report the average PRC-AUC gain after introducing the EHR semantic embedding.
In Figure 6, we show the PRC-AUC gains in the drug indication prediction by performing 10-fold validations on the PrimeKG. For groups with less than 5 drugs/diseases observed, they are moved to an extra "others" group. 10 drug groups benefit from the introduction of EHR embedding and the top 3 are "musculo-skeletal system", "respiratory system", "nervous system". The 2 groups that suffer performance drops are "dermatologicals" and "others". Among the 10 indication disease groups observed, 8 of them benefit from the EHR embedding and the most significant are "nervous system diseases" and "metabolic diseases".  3 diseases suffer performance drug, namely "circulatory system diseases", "others", "infections and parasitic diseases".
In Figure 7, we show the PRC-AUC gains in the drug side effect prediction on the SIDER(Zhang) and report the results of the test drugs. For drug groups with less than 5 drugs and side effect groups with less than 10 side effects, they are moved to an extra "others" group. The 11 drug groups observed all benefit from the EHR embedding and the improvements from "respiratory system" and "dermatologicals" are the most significant. Among the 10 groups of side effects that are observed, 9 of them benefit from the EHR embedding. And the top group is "eye/ear diseases", followed by "mental disorders" and "blood/immune diseases". Only the "others" group slightly suffers a performance drop.

Supplementary Note: Sensitivity Analysis on the Dimensionality of Semantic Embedding
We perform sensitivity analysis on the dimensionality of embedding vectors. As shown in Tabel 2, the performances are comparable between the 50-dimensional and 100-dimensional EHR embedding vectors but become poorer as we increase the dimensions to 300 or 500. To be consistent, we use 100-dimensional embedding vectors across all experiments.